My goal with this analysis is to find out whether Brazilian deputies have been using their reimbursement rights inadequately.
These reimbursements are called Quota for the Exercise of Parliamentary Activity and are “designated to pay for expenses exclusively linked to the exercise of parliamentary activity”. Therefore, as far as I could tell, there are two main ways we could detect an improper reimbursement claim (given the available data): * If the refund category is suspicious or * If the time component of the refund is suspicious
To investigate suspicious refund categories I tried plotting a bar chart of total refunds per category (of the top 7 categories). Much to my surprise, I had already found something very weird.
# Plot total value of refunds
description_summary %>%
filter(
row_number() > 10
) %>%
plot_ly(
x = ~refund_description,
y = ~refund_tot,
type = "bar",
color = ~refund_description,
colors = "Accent",
opacity = 0.70
) %>%
layout(
legend = list(
orientation = 'h',
traceorder = "reversed",
x = 0.5, y = -100
),
yaxis = list(
title = ""
),
xaxis = list(
title = "",
showticklabels = FALSE
)
)
Apparently “dissemination of parliamentary activity” is the category that has had the highest overall cost for the taxpayer: a total of R$48,645,429.54. Since this refund description is very vague, it seems to me that it is being widely used by the deputies as a cover up improper refunds.
Just to make sure I wasn’t being too quick to judge, I decided to look into this a little further and created a box-plot for the 7 categories with the highest mean refund values.
# Plot refund descriptions
deputies %>%
filter(
refund_description %in% d$refund_description[11:17]
) %>%
plot_ly(
x = ~refund_description,
y = ~refund_value,
type = "box",
color = ~refund_description,
colors = "Paired"
) %>%
layout(
legend = list(
orientation = 'h',
traceorder = "reversed",
x = 0.5, y = -100
),
yaxis = list(
title = ""
),
xaxis = list(
showticklabels = FALSE,
title = "",
mirror = TRUE
)
)
In the image above, dissemination of parliamentary activity doesn’t have the highest median, but it’s outliers stand out from the rest. If we examine to top outlier of this category (and of the whole plot), we find that it corresponds to R$184,500.00 being reimbursed for expenses at a small print shop, corroborating to the hypothesis that this category is in fact being misused by the deputies.
The next logical step would be doing some text mining on the company names to identify their types, and then check if they correspond with the assigned refund categories. For the sake of brevity, I won’t attempt to do this.
Moving on to analysing the time component of the reimbursements, I decided to get the average value of the refunds for each day of the year; with this we can tell if there is any time frame where there is an unusual number of reimbursement requests. For instance, finding out that there are too many refunds of weekends would be suspicious.
After creating loads of visualizations (which you can reproduce by running the code on my Kaggle github repo), I thought there was something weird with the plot below. It represents the density distribution of the mean value of refunds by day, see if you can guess what grabbed my attention…
# Plot density of mean amount reimbursed (day by day)
plot_ly(x = t$x, y = t$y, mode = "lines", fill = "tozeroy")
See all those lumps after R$1.5K? It means that there is a large but irregular number of days with average refund value higher than R$1.5K. If there is a time pattern to these outliers, this could mean there is something weird going on.
# Plot mean value of refunds
time_summary %>%
filter(refund_date > as_date("2016-01-01")) %>%
plot_ly(
x = ~refund_date,
y = ~refund_avg,
type = 'scatter',
mode = 'lines',
showlegend = FALSE
) %>%
layout(
yaxis = list(title = ""),
xaxis = list(title = "")
) %>%
add_trace(
x = c(as_date("2016-12-23")),
y = c(0, 10000),
mode = "lines"
)
And as expected, if we look at the time series of mean refund value per day, we can clearly see that the days with the highest averages are concentrated around late December and early January (especially after 23/Dec, represented by the orange line, which is the day the summer break starts).
In my opinion this time series shows that some deputies might be using reimbursements to pay for their vacations.
This wasn’t in any way an exhaustive report, but I think it is already able to supply enough evidence to the idea that Brazilian deputies are using public money for personal gain. In the beginning we set out to investigate two aspects of the dataset, and were able to find suspicious activity on both; there are probably many other ways to look at this data that I didn’t think of or didn’t have time for, so I highly encourage you to analyse it for yourself!
If you’d like to know more about this subject I suggest Serenata de Amor, an initiative created by Brazilian data scientists that uses AI to flag suspicious reimbursement claims. And I also have a blog where I post regularly about my data visualization and analyses, so if you liked this one there is a big chance you’ll like the rest!